#### **University of Southern California**

#### Viterbi School of Engineering

#### **EE352**

#### **Computer Organization and Architecture**

#### Cache:

Definitions

Address Mapping

Performance

#### References:

- 1) Textbook
- 2) Mark Redekopp's slide series

#### **Shahin Nazarian**

## What is Cache Memory?

- Cache memory is a small, fast memory used to hold copies of data that the processor will likely need to access in the near future
- Cache sits between the processor and main memory (MM) and is usually built onto same chip as processor
- Read and write requests will hopefully be satisfied by the fast cache rather than the slow main memory



Usual chip boundary

#### Motivation for Cache Memory

- Large memories are inherently slow
  - We need a large memory to hold code and data from multiple applications
- Small memory is inherently faster
  - Important Fact: Processor is only accessing a small fraction of code and data in any short time period
- Use both!
  - Large memory as a global store and cache as a smaller "working-set" store

#### Memory Hierarchy

 Memory hierarchy provides ability to access data quickly from lower levels while still providing large memory size



## Principle of Locality

- 2 dimensions of this principle: space & time
- Spatial Locality Future accesses will likely cluster near current accesses
  - Instructions and data arrays are sequential (they are all one after the next)
- Temporal Locality Recent accesses will likely be accessed again soon
  - Same code and data are repeatedly accessed (loops, subroutines, etc.)
  - 90/10 rule: Analysis shows that usually 10% of the written instructions account for 90% of the executed instructions

#### Cache and Locality

- Caches take advantage of locality:
- Spatial Locality
  - Caches do not store individual words but blocks of words (a.k.a. "cache line")
  - Caches always bring in a group of sequential words because if we access one, we are likely to access the next
  - Bringing in blocks of sequential words takes advantage of memory architecture (i.e. FPM, SDRAM, etc.)
- Temporal Locality
  - Leave data in the cache because it will likely be accessed again

#### Cache Blocks (Lines)

- Cache is broken into "blocks" or "lines"
  - Any time data is brought in, it will bring in the entire block of data
  - Blocks start on addresses multiples of their size
  - Usually block size matches memory burst width



#### How cache is used?

- Whenever the processor generates a read or a write, it will first check the cache memory to see if it contains the desired data
  - If so, it can get the data quickly from cache
  - Otherwise, it must go to the slow main memory to get the data





#### Cache Definitions

- Cache Hit: if requested data is present in cache
  - Read Hit: Desired read data resides in cache
  - Write Hit: Desired write data resides in cache
- Cache Miss: if requested data is not present in cache
  - Read Miss: Desired read data does not reside in cache
  - Write Miss: Desired write data does not reside in cache

#### Read-Miss Policies

- When desired read data is not in the cache, the entire block must be read in to the cache
  - There are different policies for when the desired read data is returned to the processor
- Early Restart (Load-Through): As a block is being fetched the desired word will be forwarded to processor as soon as it is loaded into the cache
  - Critical word first: Variant of early-restart where the desired word in the block is brought in from memory first, then the rest of the block (i.e. if the 2<sup>nd</sup> word in an 4word block is desired, the memory will supply word 2 then 3, 0, 1...using modulo addressing)
- No Early- Restart (No-Load-Through): Entire block is fetched before returning the desired word to the processor

#### Write Miss Policies

- When desired write data is not in cache, what should be done?
- Fetch-on-Miss (Write-Allocate): Bring block from MM to cache and then perform write w/ given write-hit policy (chances are you will read this data soon so bring in the block)
- Write Around (No Write-Allocate): Just update the word in MM and do not bring the block into cache (write around the cache going straight to MM)

#### Write (Hit) Policies

- If desired write data is already in cache or once it is brought in after a miss, what should be done?
- Write-Through: Update the desired word in MM as well as the cached version
  - This way the MM and cache versions are always "consistent" (in sync.)
- Write-Back: Update the cached version of the word and NOT the MM copy. Requires that the block be written back to main memory when it is evicted from cache
- NOTE: For our class we will assume that <u>Write-</u>
   <u>Through and No-Write Allocate</u> are always used together and that <u>Write-Back and Write-Allocate</u> are used together

#### Replacement Policies

- On a read- or write-miss, a new block must be brought in
- This requires evicting a current block residing in the cache
- Replacement policies
  - FIFO: First-in first-out (oldest block replaced)
  - LRU: Least recently used (usually best but hard to implement)
  - · Random: Actually performs surprisingly well!

#### Blocking vs. Non-blocking Cache

- A cache has an interface between itself and the processor and itself and main memory. Can the cache satisfy requests from the processor at the same time it is bringing in a block from MM?
- Blocking Cache: A blocking cache cannot satisfy other requests from the processor while it is fetching or writing blocks from/to main memory
- Non-blocking Cache: A non-blocking cache can satisfy processor requests at the same time it accesses main memory
  - Example: In an out-of-order, superscalar processor an instruction might cause a miss and stall but the processor will continue executing the following instructions

#### Implementation Terminology

- What bookkeeping values must be stored with the cache in addition to the block data?
  - Valid bit Indicates the block is occupied with valid data (i.e. not empty or invalid)
  - Dirty bit Indicates the cache and MM copies are "inconsistent" (i.e. a write has been done to the cached copy but not the main memory copy)
    - Used for write-back caches
  - Tag Portion of the block's address range used to identify the MM block residing in the cache from other MM blocks

#### Cache Memory

#### Example code snippet:

```
.text
addi $t0,$0,20

LOOP: addi $t0,$t0,-1
bne $t0,$zero,LOOP
lui $t1,0x1001
ori $t1,$t1,0x0004
sw $t0,0($t1)
```

#### **Assumptions:**

- 1) Cache access requires 10ns
- 2) MM access requires 100ns
- 3) Processor time is negligible

| MM   |          |
|------|----------|
| addi | 0x400000 |
| addi | 0x400004 |
| bne  | 0x400008 |
| lui  | 0x40000c |
| ori  | 0x400010 |
| sw   | 0x400014 |
| ?    | 0x400018 |
| ?    | 0x40001c |

# Early-Restart (Load-Through) Cache

 Early-Restart – cache returns the requested data to the processor as soon as it gets it



1 MM Block w/8 words

|      | _        |
|------|----------|
| addi | 0x400000 |
| addi | 0x400004 |
| bne  | 0x400008 |
| lui  | 0x40000c |
| ori  | 0x400010 |
| sw   | 0x400014 |
| ?    | 0x400018 |
| ?    | 0x40001c |

Time: 0 ns

 Early-Restart - cache returns the requested data to the processor as soon as it gets it



 Early-Restart - cache returns the requested data to the processor as soon as it gets it



Time: 110 ns Processor can continue

 Early-Restart - cache returns the requested data to the processor as soon as it gets it

Proc.

| <b>Cache Block</b> |                               | MM Block |          |
|--------------------|-------------------------------|----------|----------|
| addi               | add: @ 200 ma                 | addi     | 0x400000 |
| addi               | addi @ 200 ns<br>bne @ 300 ns | addi     | 0x400004 |
| bne                | lui @ 400 ns                  | bne      | 0x400008 |
| lui                | ori @ 500 ns                  | lui      | 0x40000c |
| ori                | sw @ 600 ns                   | ori      | 0x400010 |
| sw                 | ? @ 700 ns                    | sw       | 0x400014 |
| ?                  | ? @ 800 ns                    | ?        | 0x400018 |
| ?                  | . @ 000 115                   | ?        | 0x40001c |

**Time: 800 ns** 

Each successive word in the block is loaded after 100 ns (in reality this should be less due to sequential burst ability of memory)

 After 1<sup>st</sup> word is returned processor could request 2<sup>nd</sup> word...cache is already fetching



**Time: 110 ns** 

Word is returned as soon as cache receives
 it



**Time: 210 ns** 

Processor immediately requests next word



**Time: 210 ns** 

 Returned to processor as soon as cache receives it .text
addi \$t0,\$0,20

LOOP: addi \$t0,\$t0,-1
bne \$t0,\$zero,LOOP
lui \$t1,0x1001
ori \$t1,\$t1,0x0004
sw \$t0,0(\$t1)



Time: 310 ns Processor can continue

 Now due to loop, processor accesses word already in cache, it can get it in only the cache hit time (10 ns)



 addi
 0x400000

 addi
 0x400004

 bne
 0x400008

 lui
 0x40000c

 ori
 0x400010

 sw
 0x400014

 ?
 0x40001c

**MM Block** 

Time: 310 ns Processor waits

• If processor accesses word already in cache, it can get it in only the cache hit time (10 ns)



MM Block

| addi | 0x400000 |
|------|----------|
| addi | 0x400004 |
| bne  | 0x400008 |
| lui  | 0x40000c |
| ori  | 0x400010 |
| sw   | 0x400014 |
| ?    | 0x400018 |
| ?    | 0x40001c |

Time: 320 ns Processor can continue

• If processor requests a word in another block it must wait for the remainder of the current block



Time: 320 ns
Processor waits until 800 ns for cache to finish current block and keeps waiting until desired word is brought in from next block

#### Critical Word First Example

• If processor requests a word in the middle of the block, MM will return that word first and then the subsequent words, wrapping around the block address (i.e. addr n mod block\_size)



Time: 110 ns Processor waits

## Early-Restart (Load-Through) Summary

- As soon as the cache gets the requested word it will return it to the processor and the processor can continue while the cache brings in the rest of the block
- If the processor requests a new word that the cache does not have yet, it must wait until the cache gets that word (the cache may be busy bringing in the rest of the block)
- Critical word first starts with the desired word and brings in the rest of the block after that

# No Early-Restart (No Load-Through) Cache

 No Early-Restart - cache fetches entire block before returning desired word to processor



| 1 | VIIVI DIUCK |          |
|---|-------------|----------|
|   | addi        | 0x400000 |
|   | addi        | 0x400004 |
|   | bne         | 0x400008 |
|   | lui         | 0x40000c |
|   | ori         | 0x400010 |
|   | SW          | 0x400014 |
|   | ?           | 0x400018 |
|   | ?           | 0x40001c |

MM Rlock

Time: 0 ns

 No Early-Restart - cache fetches entire block before returning desired word to processor



 No Early-Restart - cache fetches entire block before returning desired word to processor



Time: 0 ns

Entire block is brought in while processor waits

 No Early-Restart - cache fetches entire block before returning desired word to processor



Time: 810 ns

addi is returned after entire block is in cache

#### **MM Block**

| addi | 0x400000 |
|------|----------|
| addi | 0x400004 |
| bne  | 0x400008 |
| lui  | 0x40000c |
| ori  | 0x400010 |
| sw   | 0x400014 |
| ?    | 0x400018 |
| ?    | 0x40001c |

Processor can request next word...already in the cache



addi returned after 10 ns

**MM Block** 

| addi | 0x400000 |
|------|----------|
| addi | 0x400004 |
| bne  | 0x400008 |
| lui  | 0x40000c |
| ori  | 0x400010 |
| sw   | 0x400014 |
| ?    | 0x400018 |
| ?    | 0x40001c |

**Time: 820 ns** 

#### No Early-Restart Summary

- The cache will always bring in the entire block before returning the requested word
- Then each subsequent access to that block can be serviced in the cache access time

## Handling Writes

- Write-Back cache (using Write-Allocate)
- Write-Through cache (using No-Write-Allocate)

- Word is only written in the cache (not MM)
- On replacement, entire block must be written back to MM

 Processor writes data to address 0x10010008 which is in cache



#### **MM Block**

| 10000000 |
|----------|
|          |
| 10000004 |
| 10000008 |
| 1000000c |
| 10000010 |
| 10000014 |
| 10000018 |
| 1000001c |
|          |

Time: 0 ns

 Processor writes data to address 0x10010008 which is in cache



| MM | Block |
|----|-------|
|    |       |

| 0000 0001 | 0x10000000 |
|-----------|------------|
| 0000 0002 | 0x10000004 |
| 0000 0003 | 0x10000008 |
| 0000 0004 | 0x1000000c |
| 0000 0005 | 0x10000010 |
| 0000 0006 | 0x10000014 |
| 0000 0007 | 0x10000018 |
| 0000 0008 | 0x1000001c |
|           |            |

Time: 10 ns

Processor can continue after 10ns it takes to write the cache only

 At some point the entire cache line (block) needs to be written back to MM



Time: 10 ns

It will take 800ns at some point to write the cache line back

- Word is written in both cache and MM
- On replacement, block does not have to be written back since MM already has updated values

Processor writes data to address
 0x10010008 which is in cache



#### **MM Block**

| 0000 0001 | 0x10000000 |
|-----------|------------|
| 0000 0002 | 0x10000004 |
| 0000 0003 | 0x10000008 |
| 0000 0004 | 0x1000000c |
| 0000 0005 | 0x10000010 |
| 0000 0006 | 0x10000014 |
| 0000 0007 | 0x10000018 |
| 0000 0008 | 0x1000001c |

Time: 0 ns

Processor writes data to address
 0x10010008 which is in cache and to MM



**Time: 100 ns** 

 When block is replaced in cache it need not be written back to MM

Cache Block

Proc.

 0000 0001
 0x10000000

 0000 0002
 0x10000004

 1234 5678
 0x10000008

 0000 0004
 0x1000000c

 0000 0005
 0x10000014

 0x10000014

0x10000018

0x1000001c

MM Block

0000 0007

0000 0008

**Time: 100 ns** 

#### Example

 Assume we execute a loop 40 times, and each iteration writes to address 0x10010008

 Iteration 1: Processor writes data to address 0x10010008 which is in cache



#### 0000 0001 0x10000000 0000 0002 0x100000040000 0003 0x10000008 0000 0004 0x1000000c 0000 0005 0x10000010 0000 0006 0x100000140000 0007 0x10000018 0000 0008 0x1000001c

MM Block

Time: 0 ns

Iteration 1 completes after 10 ns.



#### **MM Block**

| 0000 0001 | 0x10000000 |
|-----------|------------|
| 0000 0002 | 0x10000004 |
| 0000 0003 | 0x10000008 |
| 0000 0004 | 0x1000000c |
| 0000 0005 | 0x10000010 |
| 0000 0006 | 0x10000014 |
| 0000 0007 | 0x10000018 |
| 0000 0008 | 0x1000001c |
|           |            |

Processor can continue after 10 ns it takes to write the cache only

Time: 10 ns

#### Iteration 2 completes after 20 ns



#### **MM Block**

| 0000 0001 | 0x10000000 |
|-----------|------------|
| 0000 0002 | 0x10000004 |
| 0000 0003 | 0x10000008 |
| 0000 0004 | 0x1000000c |
| 0000 0005 | 0x10000010 |
| 0000 0006 | 0x10000014 |
| 0000 0007 | 0x10000018 |
| 0000 0008 | 0x1000001c |
|           |            |

Time: 20 ns

Processor can continue after 10 ns it takes to write the cache only

 Iteration 40: Processor writes data to address 0x10010008 which is in cache



#### **MM Block**

| 0000 0001 | 0x10000000 |
|-----------|------------|
| 0000 0002 | 0x10000004 |
| 0000 0003 | 0x10000008 |
| 0000 0004 | 0x1000000c |
| 0000 0005 | 0x10000010 |
| 0000 0006 | 0x10000014 |
| 0000 0007 | 0x10000018 |
| 0000 0008 | 0x1000001c |

**Time: 400 ns** 

 On replacement, the entire block must be written back (8 words \* 100 ns)

Proc.

| C | Cache Block |           | MM Block  |            |
|---|-------------|-----------|-----------|------------|
|   | 0000 0001   | @ 500 ns  | 0000 0001 | 0x10000000 |
|   | 0000 0002   | @ 600 ns  | 0000 0002 | 0x10000004 |
|   |             | @ 700 ns  |           |            |
|   | 5678 FE00   | @ 800 ns  | 0000 0003 | 0x10000008 |
|   | 0000 0004   | @ 900 ns  | 0000 0004 | 0x1000000c |
|   | 0000 0005   |           | 0000 0005 | 0x10000010 |
|   | 0000 0006   | @ 1000 ns | 0000 0006 | 0x10000014 |
|   | 0000 0007   | @ 1100 ns | 0000 0007 | 0x10000018 |
|   |             | @ 1200 ns |           |            |
|   | 0000 0008   |           | 0000 0008 | 0x1000001c |

**Time: 1200 ns** 

 Iteration 1: Processor writes data to address 0x10010008 which is in cache and also in MM



**Time: 100 ns** 

 Iteration 40: Processor writes data to address 0x10010008 which is in cache and also in MM



**Time: 4000 ns** 

 On replacement, the block in cache need not be written back...they are the same

Cache Block

Proc.

MM Block

**Time: 4000 ns** 

## Analysis

- Write-back
  - Best results when repeated accesses to a block outweigh cost of writeback (i.e. writes in loops)
- Write-through
  - Best results when few, isolated accesses (no need for writeback)

## Cache Implementation

- Assume 12-bit addresses and 4-word blocks
- Block addresses will range from xx0-xxF
  - Address can be broken down as follows

A[11:4]

- A[11:4] = identifies block range (i.e. xx0-xxF)
- A[3:2] = selects the 1 of 4 words within the block
- A[1:0] = unused (always access 32-bit word)

A[3:2] A[1:0]



## Cache Implementation

 To identify which MM block resides in each cache block, the tags need to be stored along with the Dirty and Valid bits

| Tog   | -           |             |                          |
|-------|-------------|-------------|--------------------------|
| Tag ~ | 1010 1100   |             | Data of 0xAC0-ACF        |
|       | V=1         | <b>D</b> =0 | (unmodified)             |
|       | 0100        | 0111        | <b>Data of 0x470-47F</b> |
|       | V=1         | D=1         | (modified)               |
|       | 0000        | 0000        | empty                    |
|       | <b>V</b> =0 | <b>D</b> =0 |                          |
|       | 0000        | 0000        | empty                    |
|       | V=0         | D=0         |                          |

#### Content-Addressable Memory

- Cache memory is one form of what is known as "content-addressable" memory
  - This means data can be in any location in memory and does not have one particular address
  - Additional information is saved with the data and is used to "address"/find the desired data (this is the "tag" in this case) via a search on each access

| Tog   |        |             |                          |
|-------|--------|-------------|--------------------------|
| Tag ~ | 1010 1 | 1100        | Data of 0xAC0-ACF        |
|       | V=1    | <b>D</b> =0 | (unmodified)             |
|       | 0100 ( | 0111        | <b>Data of 0x470-47F</b> |
|       | V=1    | <b>D</b> =1 | (modified)               |
|       | 0000   | 0000        | empty                    |
|       | V=0    | <b>D</b> =0 | - 0                      |
|       | 0000   | 0000        | empty                    |
|       | V=0    | D=0         |                          |

#### Tag Comparison

 When caches have many blocks (> 16 or 32) it can be expensive (hardware-wise) to check all tags



# Tag Comparison Example

 Tag portion of desired address is checked against all the tags and qualified with the valid bits to determine a hit



#### Mapping Techniques

- Determines where blocks can be placed in the cache
- By reducing number of possible MM blocks that map to a cache block, hit logic (searches) can be done faster
- 3 Primary Methods
  - Direct Mapping
  - Fully Associative Mapping
  - Set-Associative Mapping

## Mapping techniques

- Example cache w/ only 4 blocks
- MM has many blocks

#### Cache

Cache Block 1

Cache Block 2

Cache Block 3

#### Memory

| Block 0 |
|---------|
| Block 1 |
| Block 2 |
| Block 3 |
| Block 4 |
| Block 5 |
| Block 6 |
| Block 7 |
| Block 8 |

 Any block from memory can be put in any cache block (i.e. no restriction)



## Direct Mapping

- Each block from memory can only be put in one location
- Given n cache blocks,
   MM block i maps to cache block i mod n



#### K-way Set-Associative Mapping

- · Given, s sets, block i of MM maps to set i mod s
- Within the set, block can be put anywhere
- Let k = number of cache blocks in a set = n/s
  - K comparisons required for search
  - Usually given k & need to solve for s



#### Mapping Implementation

- Q: How to implement mapping schemes
  - Direct Mapping
  - Fully Associative Mapping
  - Set-Associative Mapping
- A: By using the addresses themselves

#### **Assumptions:**

- Only access words
- 12-bit MM addresses (4096 Bytes)
- 4-word blocks



## Fully Associative Implementation

- 12-bit address:
  - $4 = 2^2$  words per block => 2 LSB's above A[1:0] used to determine the desired word w/in the block
  - Tag = Block # = Upper bits used to identify the block in the cache



#### Fully Associative Address Scheme

- A[1:0] unused (word access only)
- Word bits = log<sub>2</sub>B bits (B=Block Size)
- Tag = Remaining bits

- Any block from memory can be put in any cache block (i.e. no mapping scheme)
- Completely flexible









# Fully Associative Mapping

 Any block from memory can be put in any cache block (i.e. no mapping)



# Fully Associative Mapping

 Any block from memory can be put in any cache block (i.e. no mapping)



# Fully Associative Mapping

 Any block from memory can be put in any cache block (i.e. no mapping)



- Each block from memory can only be put in one location
- MM block i maps to cache block i mod n



## Direct Mapping Implementation

#### 12-bit address:

- 4 = 2<sup>2</sup> words per block => 2 bits of address used to determine the desired word w/in the block
- 4 = 2<sup>2</sup> possible blocks => 2 bits to determine block...next 2 bits of address
- Tag = Upper 6 bits used to identify the block in the cache (identifies between block (0,4,8,0xC,0x10, etc.)



#### Direct Mapping Implementation

#### 12-bit address:

- 4 = 2<sup>2</sup> words per block => 2 bits of address used to determine the desired word w/in the block
- 4 = 2<sup>2</sup> possible blocks => 2 bits to determine block...next 2 bits of address
- Tag = Upper 6 bits used to identify the block in the cache (identifies between block (0,4,8,0xC,0x10, etc.)



#### Direct Mapping Address Scheme

- A[1:0] unused
- Word bits = log<sub>2</sub>B bits (B=Block Size)
- Block bits = log<sub>2</sub>n bits (n=# of Cache Blocks)
- Tag = Remaining bits

- Each block from memory can only be put in one location
- MM block i maps to cache block i mod n

| Tag | Cache         |
|-----|---------------|
|     | Cache Block 0 |
|     | Cache Block 1 |
|     | Cache Block 2 |
|     | Cache Block 3 |

# Block 0 Block 1 Block 2 Block 3 Block FC Block FC Block FD Block FE Block FF

- Each block from memory can only be put in one location
- MM block i maps to cache block i mod n



- Each block from memory can only be put in one location
- MM block i maps to cache block i mod n



- Each block from memory can only be put in one location
- MM block i maps to cache block i mod n



- Each block from memory can only be put in one location
- MM block i maps to cache block i mod n



- Each block from memory can only be put in one location
- MM block i maps to cache block i mod n



- Each block from memory can only be put in one location
- MM block i maps to cache block i mod n



- Each block from memory can only be put in one location
- MM block i maps to cache block i mod n



- Each block from memory can only be put in one location
- MM block i maps to cache block i mod n





- 12-bit address:
  - 4 = 2<sup>2</sup> words per block => 2 bits of address used to determine the desired word w/in the block
  - 2 = 2¹ sets => next 1 bit used to determine which set
  - Tag = Upper 7 bits used to identify the block in the cache



#### K-way Set Associative Address Scheme

- A[1:0] unused (word access only)
- Word bits = log<sub>2</sub>B bits (B=Block Size)
- Set bits =  $log_2S$  bits
  - S = n/K = # of sets and n = # of cache blocks

Tag = Remaining bits



| Tag     | Set | Wor | d A[1 | :0] |
|---------|-----|-----|-------|-----|
| 0000000 | 0   | 00  | 00    |     |













## Address Mapping Examples

- 16-bit addresses, 1 KB cache,
   8 words/block
- Find address mapping for:
  - Fully Associative
  - Direct Mapping
  - 4-way Set Associative
  - 8-way Set Associative

## Address Mapping Examples - Solution

- First find parameters:
  - B = Block size
  - n = # of Cache blocks
  - S = Sets for 4-way and 8-way
- B is given as 8 words/block = 32 bytes/block
- n depends on cache size and block size
  - n = (1 KB  $\div$  4 bytes/word)  $\div$  8 words/block =  $(2^{10} \div 2^2) \div 2^3 = 2^5 = 32$  blocks in the cache
- 5 for 4-way & 8-way
  - S = n/k = 32/4 = 8 sets
  - S = n/k = 32/8 = 4 sets

# Fully Associative

- A1-A0 = unused (only word accesses)
- $log_28 = 3$  word bits (A4-A2)
- Tag = 11 Upper bits (A15-A5)

**Parameters:** 

B = 8



- A1-A0 = unused (only word accesses)
- $log_28 = 3$  word bits (A4-A2)
- $\log_2 32 = 5 \text{ block bits } (A9-A5)$
- Tag = 6 Upper bits (A15-A10)

#### **Parameters:**

$$B = 8$$
$$n = 32$$



# 4-Way Set Assoc. Mapping

- A1-A0 = unused (only word accesses)
- $log_28 = 3$  word bits (A4-A2)
- $log_28 = 3$  set bits (A7-A5)
- Tag = 8 Upper bits (A15-A8)

#### **Parameters:**

# 8-Way Set Assoc. Mapping

- A1-A0 = unused (only word accesses)
- $log_28 = 3$  word bits (A4-A2)
- $log_24 = 2$  set bits (A6-A5)
- Tag = 9 Upper bits (A15-A7)

#### **Parameters:**

 $\mathbf{B} = \mathbf{8}$  $\mathbf{N} = \mathbf{32}$ 

S8-way = 4

# Cache Operation Example

#### Address Trace

• R: 0x00a0

• W: 0x00f4

• R: 0x00b0

• W: 0x2a2c

#### Operations

- Hit
- Fetch block XX
- Evict block XX(w/ or w/o WB)
- Final WB of block XX)

- Perform address breakdown and apply address trace
- 2-Way Set-Assoc, N=4, B=8 words

| Address | Tag          | Set | Word | Unused |
|---------|--------------|-----|------|--------|
| 0x00a0  | 0000 0000 10 | 1   | 000  | 00     |
| 0x00f4  | 0000 0000 11 | 1   | 101  | 00     |
| 0x00b0  | 0000 0000 10 | 1   | 100  | 00     |
| 0x2a2c  | 0010 1010 00 | 1   | 011  | 00     |

| Processor<br>Access | Cache Operation                                |
|---------------------|------------------------------------------------|
| R: 0x00a0           | Fetch Block 00a0-00bf                          |
| W: 0x00f4           | Fetch Block 00e0-00ff                          |
| R: 0x00b0           | Hit                                            |
| W: 0x2a2c           | Evict 00e0-00ff w/ WB<br>Fetch Block 2a20-2a3f |
| Done!               | Final WB of 2a20-2a3f                          |

#### Levels of Cache

- If one cache is good, 2 or more might be better
- L1 cache is closest to processor with increasing numbers progressing outwards
- L1 is smallest and fastest cache with higher levels being slower but bigger to hold more data
- On a read or write, check L1 cache. On a miss, check L2. If it hits in L2 bring the block into L1.
   If L2 misses then get the data from memory



#### Principle of Inclusion

- When a block is brought in from memory or an upper level, it is brought into all lower levels, not just L1
- This implies that lower levels always contains a subset of higher levels
  - L1 contains most recently used data
  - L2 contains that data + data used earlier
  - MM contains all data



#### Pentium 4 Cache

- L1 Cache Broken into separate caches to hold instructions or data
  - I-Cache (Instruction)
    - 12K Entries
  - D-Cache (Data)
    - 16 KB
    - 4-Way Set Assoc.
- L2 Cache
  - Both I & D together
  - 1-2 MB
  - 8-Way Set Associative



**Pentium 4 Cache** 

#### Pentium 4

#### L2 Cache





L1 Instruc.

#### Average Access Time

- Define parameters
  - H<sub>i</sub> = Hit Rate of Cache Level L<sub>i</sub>
     (Note that 1-H<sub>i</sub> = Miss rate)
  - T<sub>i</sub> = Access time of level i
  - R<sub>i</sub> = Burst rate per word of level i (after startup access time)
  - B = Block Size
- We will find  $T_{AVE}$  = average access time
- Assume No-Load-Through cache
  - Whether hit or miss, you must spend at least
     T1 time to get the value from the L1 cache

#### Tave without L2 cache

- 2 possible cases:
  - Either we have a hit and pay only the L1 cache hit time
  - Or we have a miss and read in the whole block to L1 and then read from L1 to the processor

$$T_{ave} = T_1 + (1-H_1) \cdot [T_{MM} + B \cdot R_{MM}]$$

(Miss Rate)\*(Miss Penalty)

• For 
$$T_1=10ns$$
,  $H_1=0.9$ ,  $B=8$ ,  $T_{MM}=100ns$ ,  $R_{MM}=25ns$ 

• 
$$T_{ave} = 10 + [(0.1) \cdot (100 + 8.25)] = 40 \text{ ns}$$

#### Tave with L2 cache

#### 3 possible cases:

- Either we have a hit and pay the L1 cache hit time
- Or we miss L1 but hit L2 and read in the block from L2
- Or we miss L1 and L2 and read in the block from MM

$$T_{ave} = T_1 + (1-H_1)\cdot H_2\cdot (T_2+B\cdot R_2) + (1-H_1)\cdot (1-H_2)\cdot (T_{MM}+B\cdot R_{MM})$$
L1 miss / L2 Hit

L1 miss / L2 Miss

- For  $T_1 = 10$ ns,  $H_1 = 0.9$ ,  $T_2 = 20$ ns,  $R_2 = 10$ ns,  $H_2 = 0.98$ , B=8,  $T_{MM}=100$ ns,  $R_{MM}=25$  ns
- $T_{ave} = 10 + (0.1) \cdot (.98) \cdot (20 + 8 \cdot 10) + (0.1) \cdot (.02) \cdot (100 + 8 \cdot 25)$ = 10 + 9.8 ns + 0.6 = 20.4 ns

# Cache Configurations

|                       | AMD Opteron     | Intel P4            | PPC 7447a           |
|-----------------------|-----------------|---------------------|---------------------|
| Clock rate (2004)     | 2.0 GHz         | 3.2 GHz             | 1.5 – 2 GHz         |
| Instruction<br>Cache  | 64KB, 2-way SA  | 96 KB               | 32 KB, 8-way SA     |
| Latency (clocks)      | 3               | 4                   | 1                   |
| Data cache            | 64 KB, 2-way SA | 8 KB, 4-way SA      | 32 KB, 8-way SA     |
| Latency (clocks)      | 3               | 2                   | 1                   |
| L1 Write Policy       | Write-back      | Write-through       | Programmable        |
| On-chip L2            | 1 MB, 16-way SA | 512 KB, 8-way<br>SA | 512 KB, 8-way<br>SA |
| L2 Latency            | 6               | 5                   | 9                   |
| Block size<br>(L1/L2) | 64              | 64/128              | 32/64               |
| L2 Write-Policy       | Write-back      | Write-back          | Programmable        |

Sources: H&P, "CO&D", 3<sup>rd</sup> ed., Freescale.com

#### Miss Rate

- Reducing Miss Rate means lower T<sub>AVE</sub>
- To analyze miss rate categorize them based on why they occur
  - Compulsory Misses
    - -First access to a block will always result in a miss
  - Capacity Misses
    - Misses because the cache is too small
  - · Conflict Misses
    - Misses due to mapping scheme (replacement of direct or set associative)

#### Miss Rate & Block Size



Graph used courtesy "Computer Architecture: AQA, 3<sup>rd</sup> ed.", Hennessey and Patterson

# Miss Rate & Associativity



Graph used courtesy "Computer Architecture: AQA, 3<sup>rd</sup> ed.", Hennessey and Patterson

# Prefetching

#### Hardware Prefetching

- On miss of block i, fetch block i and i+1
- Advanced "speculation" / Runahead execution
  - While processor is waiting for a cache block, try to continue executing "future" instructions looking only for memory references and not storing results of instructions

#### Software Prefetching

- Special "Prefetch" Instructions
- Compiler inserts these instructions to give hints ahead of time as to the upcoming access pattern

# Cache-Conscious Programming

- Order of array indexing
  - Row major vs. column major ordering
- Blocking
  - Break matrices into smaller ones which can fit into cache, perform necessary operations on the block, then merge block results
- Pointer-chasing
  - Linked lists, graphs, tree data structures that use pointers do not exhibit good spatial locality





**Original** 

Matrix



**Blocked** 

Matrix